Fix GPU Hang in Gemma4 and add metrics by solderzzc · Pull Request #40 · SharpAI/mlx-swift-lm

solderzzc · 2026-05-12T21:21:19Z

Pushes the final GPU hang fix and MTP metric propagation that was missed before the main branch merge.

…tter and implement MTP acceptance rate metric extraction

Copilot

Pull request overview

This PR aims to (1) prevent a GPU hang in Gemma4’s sparse masked-embedder logit projection by removing a per-row scatter loop, and (2) propagate speculative/MTP draft-token metrics into the generation completion info so callers can observe acceptance rates.

Changes:

Adds acceptedDraftTokens / totalDraftTokens tracking to the common token-iterator interface and includes them in GenerateCompletionInfo emitted at the end of async generation.
Updates SpeculativeTokenIterator to accumulate accepted/total draft-token counts per speculation round.
Replaces a CPU-style per-row scatter loop in Gemma4Text.maskedEmbedderLogits with a vectorized advanced-index assignment.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
Libraries/MLXLMCommon/Evaluate.swift	Adds draft-token metrics to iterators and propagates them into `GenerateCompletionInfo` for async generation.
Libraries/MLXLLM/Models/Gemma4Text.swift	Vectorizes candidate-logit scattering into the output vocab tensor to address a GPU hang.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        let rowIndices = MLXArray(0 ..< Int32(B * S)).reshaped([B * S, 1])
+        output2D[rowIndices, scatterIdx2D] = selectedLogits2D


solderzzc

Great catch regarding Int32(B * S)! Using MLXArray.arange(B * S).asType(.int32) correctly bypasses the Swift 32-bit width overflow trap while keeping the MLX pipeline safe. Applied in the latest commit.

Aegis-AI added 2 commits May 12, 2026 13:10

Fix GPU Hang in Gemma4 cluster logit projection by vectorizing 2D sca…

b9bf50b

…tter and implement MTP acceptance rate metric extraction

Merge feat/mtp-speculative-decoding to main

dd49781

Copilot AI review requested due to automatic review settings May 12, 2026 21:21

Copilot started reviewing on behalf of solderzzc May 12, 2026 21:21 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread Libraries/MLXLLM/Models/Gemma4Text.swift Outdated

Comment on lines +1120 to +1121

let rowIndices = MLXArray(0 ..< Int32(B * S)).reshaped([B * S, 1])

output2D[rowIndices, scatterIdx2D] = selectedLogits2D

Address Copilot review: avoid Int32 overflow via MLX.arange

b42d9a0

solderzzc commented May 12, 2026

View reviewed changes

solderzzc merged commit 7c45487 into main May 12, 2026
6 checks passed

solderzzc deleted the fix/mtp-gpu-hang branch May 12, 2026 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU Hang in Gemma4 and add metrics#40

Fix GPU Hang in Gemma4 and add metrics#40
solderzzc merged 3 commits into
mainfrom
fix/mtp-gpu-hang

solderzzc commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

solderzzc left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let rowIndices = MLXArray(0 ..< Int32(B * S)).reshaped([B * S, 1])
		output2D[rowIndices, scatterIdx2D] = selectedLogits2D

Conversation

solderzzc commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

solderzzc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants